-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove unnecessary call to F.pad
#10620
remove unnecessary call to F.pad
#10620
Conversation
F.pad
and better memory countF.pad
, improved calculation of memory
F.pad
, improved calculation of memoryF.pad
, improved calculation of memory_count
F.pad
, improved calculation of memory_count
F.pad
, improved calculation of memory_count
Hi @bm-synth. Thanks for your contribution. Can you share some figures on the memory and performance improvements? |
Hi @hlky. Running the following import time
import torch
import torch.nn as nn
import torch.nn.functional as F
from diffusers.models.autoencoders.autoencoder_kl_cogvideox import CogVideoXCausalConv3d
torch.manual_seed(42)
def train(model: nn.Module, video_input: torch.Tensor):
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
model.train()
start_train = time.time()
for iteration in range(100): # Simulate 100 training iterations
optimizer.zero_grad()
output = model(video_input)[0]
loss = F.mse_loss(output, output+iteration) # sum iteration to fake different grads per iteration
loss.backward()
optimizer.step()
torch.cuda.synchronize()
train_time = time.time() - start_train
print("train_time", train_time, "secs")
return output.to("cpu")
def eval(model: nn.Module, video_input: torch.Tensor):
model.eval()
start_train = time.time()
with torch.no_grad():
for _ in range(300): # Simulate 300 inference iterations
model(video_input)
torch.cuda.synchronize()
eval_time = time.time() - start_train
print("eval_time", eval_time, "secs") calling with that input shape $ PYTHONPATH=./diffusers_main/src/ python test_autoencoder.py
input size: 0.498046875 GBs
eval_time 33.06385564804077 secs
train_time 34.33984375 secs
Max memory 22.18018913269043 GBs calling this PR branch gives: $ PYTHONPATH=./diffusers_PR/src/ python test_autoencoder.py
input size: 0.498046875 GBs
eval_time 31.588099241256714 secs
train_time 34.1251916885376 secs
Max memory 22.17398452758789 GBs on the shape $ PYTHONPATH=./diffusers_main/src/ python test_autoencoder.py
input size: 0.43773651123046875 GBs
eval_time 17.759469032287598 secs
train_time 96.50320744514465 secs
Max memory 16.353439331054688 GBs and this PR: $ PYTHONPATH=./diffusers_PR/src/ python test_autoencoder.py
input size: 0.43773651123046875 GBs
eval_time 16.8880774974823 secs
train_time 96.04004764556885 secs
Max memory 16.34803009033203 GBs I'll try to test more dimensions. |
F.pad
, improved calculation of memory_count
F.pad
, improved calculation of memory_count
@bm-synth Great, thanks. Would it also be possible to verify numerical accuracy between the two versions? For a change like this we would expect between 0 to 1e-6 difference. |
@hlky I updated the code above to fix a seed ( if __name__=='__main__':
output_main: torch.Tensor = torch.load("output_main.pt")
output_PR: torch.Tensor = torch.load("output_PR.pt")
print("mean:", output_main.mean().item(), "vs", output_PR.mean().item())
print("std:", output_main.std().item(), "vs", output_PR.std().item())
print("max abs diff:", (output_PR-output_main).diff().abs().max().item())
assert torch.allclose(output_main, output_PR) output:
|
@hlky ping? |
Hi @bm-synth. We need to verify the accuracy of Code
from typing import Optional, Tuple, Union
import torch
import torch.nn as nn
import torch.nn.functional as F
from diffusers.models.autoencoders.autoencoder_kl_cogvideox import CogVideoXCausalConv3d
class CogVideoXSafeConv3d_PR(nn.Conv3d):
r"""
A 3D convolution layer that splits the input tensor into smaller parts to avoid OOM in CogVideoX Model.
"""
def forward(self, input: torch.Tensor) -> torch.Tensor:
memory_count = torch.prod(torch.tensor(input.shape)) * 2 / 1024**3
# Set to 2GB, suitable for CuDNN
if memory_count > 2:
kernel_size = self.kernel_size[0]
part_num = int(memory_count / 2) + 1
input_chunks = torch.chunk(input, part_num, dim=2)
if kernel_size > 1:
input_chunks = [input_chunks[0]] + [
torch.cat((input_chunks[i - 1][:, :, -kernel_size + 1 :], input_chunks[i]), dim=2)
for i in range(1, len(input_chunks))
]
output_chunks = []
for input_chunk in input_chunks:
output_chunks.append(super().forward(input_chunk))
output = torch.cat(output_chunks, dim=2)
return output
else:
return super().forward(input)
class CogVideoXCausalConv3d_PR(nn.Module):
r"""A 3D causal convolution layer that pads the input tensor to ensure causality in CogVideoX Model.
Args:
in_channels (`int`): Number of channels in the input tensor.
out_channels (`int`): Number of output channels produced by the convolution.
kernel_size (`int` or `Tuple[int, int, int]`): Kernel size of the convolutional kernel.
stride (`int`, defaults to `1`): Stride of the convolution.
dilation (`int`, defaults to `1`): Dilation rate of the convolution.
pad_mode (`str`, defaults to `"constant"`): Padding mode.
"""
def __init__(
self,
in_channels: int,
out_channels: int,
kernel_size: Union[int, Tuple[int, int, int]],
stride: int = 1,
dilation: int = 1,
pad_mode: str = "constant",
):
super().__init__()
if isinstance(kernel_size, int):
kernel_size = (kernel_size,) * 3
time_kernel_size, height_kernel_size, width_kernel_size = kernel_size
# TODO(aryan): configure calculation based on stride and dilation in the future.
# Since CogVideoX does not use it, it is currently tailored to "just work" with Mochi
time_pad = time_kernel_size - 1
height_pad = (height_kernel_size - 1) // 2
width_pad = (width_kernel_size - 1) // 2
self.pad_mode = pad_mode
self.height_pad = height_pad
self.width_pad = width_pad
self.time_pad = time_pad
self.time_causal_padding = (width_pad, width_pad, height_pad, height_pad, time_pad, 0)
self.const_padding_conv3d = (0, self.width_pad, self.height_pad)
self.temporal_dim = 2
self.time_kernel_size = time_kernel_size
stride = stride if isinstance(stride, tuple) else (stride, 1, 1)
dilation = (dilation, 1, 1)
self.conv = CogVideoXSafeConv3d_PR(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
dilation=dilation,
padding=0 if self.pad_mode == "replicate" else self.const_padding_conv3d,
padding_mode="zeros",
)
def fake_context_parallel_forward(
self, inputs: torch.Tensor, conv_cache: Optional[torch.Tensor] = None
) -> torch.Tensor:
if self.pad_mode == "replicate":
inputs = F.pad(inputs, self.time_causal_padding, mode="replicate")
else:
kernel_size = self.time_kernel_size
if kernel_size > 1:
cached_inputs = [conv_cache] if conv_cache is not None else [inputs[:, :, :1]] * (kernel_size - 1)
inputs = torch.cat(cached_inputs + [inputs], dim=2)
return inputs
def forward(self, inputs: torch.Tensor, conv_cache: Optional[torch.Tensor] = None) -> torch.Tensor:
inputs = self.fake_context_parallel_forward(inputs, conv_cache)
if self.pad_mode == "replicate":
conv_cache = None
else:
conv_cache = inputs[:, :, -self.time_kernel_size + 1 :].clone()
output = self.conv(inputs)
return output, conv_cache
model = CogVideoXCausalConv3d(in_channels=128, out_channels=512, kernel_size=3).eval()
with torch.no_grad():
output = model(torch.randn([1, 128, 8, 544, 960], generator=torch.Generator().manual_seed(0)))[0]
with torch.no_grad():
output_2 = model(torch.randn([1, 128, 8, 544, 960], generator=torch.Generator().manual_seed(0)))[0]
torch.testing.assert_close(output, output_2)
print((output - output_2).abs().max())
model_pr = CogVideoXCausalConv3d_PR(in_channels=128, out_channels=512, kernel_size=3).eval()
with torch.no_grad():
output_pr = model_pr(torch.randn([1, 128, 8, 544, 960], generator=torch.Generator().manual_seed(0)))[0] torch.testing.assert_close(output, output_pr)
Mismatched elements: 2139073042 / 2139095040 (100.0%)
Greatest absolute difference: 5.3313703536987305 at index (0, 421, 5, 286, 946) (up to 1e-05 allowed)
Greatest relative difference: 2893981952.0 at index (0, 348, 3, 142, 869) (up to 1.3e-06 allowed) print((output - output_pr).abs().max())
tensor(5.3314) The first check here Also note the code will run on CPU in float32, this is to avoid other source of non-determinism when testing although it will use a large amount of memory. Generally we choose the smallest possible shape and model configuration for tests. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@hlky your code has an issue. the random state before initializing each if __name__=='__main__':
print(f"diffusers version: {diffusers.__version__}")
torch.manual_seed(42) #### <---- added this
model = CogVideoXCausalConv3d(in_channels=128, out_channels=512, kernel_size=3).eval()
with torch.no_grad():
output = model(torch.randn([1, 128, 8, 544, 960], generator=torch.Generator().manual_seed(0)))[0]
with torch.no_grad():
output_2 = model(torch.randn([1, 128, 8, 544, 960], generator=torch.Generator().manual_seed(0)))[0]
torch.testing.assert_close(output, output_2)
print("max abs difference (output, output_2):", (output - output_2).abs().max().item())
print("number of different elements (output, output_2):", (output != output_2).sum().item())
torch.manual_seed(42) ##### <---- added this
model_pr = CogVideoXCausalConv3d_PR(in_channels=128, out_channels=512, kernel_size=3).eval()
with torch.no_grad():
output_pr = model_pr(torch.randn([1, 128, 8, 544, 960], generator=torch.Generator().manual_seed(0)))[0]
torch.testing.assert_close(output, output_pr)
print("max abs difference (output, output_pr):", (output - output_pr).abs().max().item())
print("number of different elements (output, output_pr):", (output != output_pr).sum().item()) output:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still concerned by the reproducibility, especially as it's with float32 on CPU. Using |
@hlky (cc @yiyixuxu ) your reproducibility concerns are not related to this PR. They are an issue in the production code. This PR gives the same behaviour as production. If you run the code below, where you only call your current CogVideoX module twice, without setting the random seed, it already fails. See below: if __name__=='__main__':
print(f"diffusers version: {diffusers.__version__}")
torch.manual_seed(42)
model = CogVideoXCausalConv3d(in_channels=128, out_channels=512, kernel_size=3).eval()
with torch.no_grad():
output = model(torch.randn([1, 128, 8, 544, 960], generator=torch.Generator().manual_seed(0)))[0]
with torch.no_grad():
output_2 = model(torch.randn([1, 128, 8, 544, 960], generator=torch.Generator().manual_seed(0)))[0]
torch.testing.assert_close(output, output_2)
print("max abs difference (output, output_2):", (output - output_2).abs().max().item())
print("number of different elements (output, output_2):", (output != output_2).sum().item())
# torch.manual_seed(42) ##### <--- THE SECOND CHECK ONLY PASSES IF YOU UNCOMMENT THIS!
model_main = CogVideoXCausalConv3d_PR(in_channels=128, out_channels=512, kernel_size=3).eval()
model_main = CogVideoXCausalConv3d_PR(in_channels=128, out_channels=512, kernel_size=3).eval()
with torch.no_grad():
output_main = model_pr(torch.randn([1, 128, 8, 544, 960], generator=torch.Generator().manual_seed(0)))[0]
torch.testing.assert_close(output, output_main)
print("max abs difference (output, output_pr):", (output - output_main).abs().max().item())
print("number of different elements (output, output_pr):", (output != output_main).sum().item()) |
We get the same |
@hlky what i meant is:
What am I missing here? |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Can you run more benchmarks? From #10620 (comment) the benefit is not clear and it looks like the figures for memory only cover training. Follow our standard benchmarking methodology Lines 52 to 58 in 1357931
https://github.com/huggingface/diffusers/tree/main/benchmarks #10620 (comment) also mentions improvement in torch.compile, can we demonstrate this? what specifically is the optimization it allows in torch.compile? |
@hlky running this code that calls your def benchmark_fn(f, *args, **kwargs):
import torch.utils.benchmark as benchmark
t0 = benchmark.Timer(
stmt="f(*args, **kwargs)",
globals={"args": args, "kwargs": kwargs, "f": f},
num_threads=torch.get_num_threads(),
)
return f"{(t0.blocked_autorange().mean):.5f}"
def train(model, inputs, targets, criterion, optimizer):
model.train()
optimizer.zero_grad()
outputs = model(inputs)
if isinstance(outputs, tuple):
outputs = outputs[0] # only use the first tensor as output
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
return loss.item()
def inference(model, inputs):
model.eval()
with torch.no_grad():
outputs = model(inputs)
return outputs
if __name__=='__main__':
print(f"diffusers version: {diffusers.__version__}")
to_kw = dict(device="cuda", dtype=torch.bfloat16)
torch.manual_seed(42)
model_main = CogVideoXCausalConv3d(in_channels=128, out_channels=512, kernel_size=3).to(**to_kw)
model_main = torch.compile(model_main, mode="max-autotune-no-cudagraphs")
optimizer_main = torch.optim.Adam(model_main.parameters(), lr=1e-3)
torch.manual_seed(42)
model_pr = CogVideoXCausalConv3d(in_channels=128, out_channels=512, kernel_size=3).to(**to_kw)
model_pr = torch.compile(model_pr, mode="max-autotune-no-cudagraphs")
optimizer_pr = torch.optim.Adam(model_pr.parameters(), lr=1e-3)
criterion = nn.MSELoss()
# test 5 very different shape inputs
for shape in [
(1, 128, 8, 544, 960),
(3, 128, 22, 12, 123),
(5, 128, 2, 4, 324),
(2, 128, 32, 128, 128),
(9, 128, 64, 24, 123),
]:
input = torch.randn(shape, generator=torch.Generator().manual_seed(0)).to(**to_kw)
output_shape = ( shape[0], 512, shape[2], shape[3], shape[4] )
target = torch.randn(output_shape, generator=torch.Generator().manual_seed(0)).to(**to_kw)
# few iterations to warm up the model and compiler
for _ in range(5):
benchmark_fn( train, model_main, input, target, criterion, optimizer_main)
benchmark_fn( train, model_pr, input, target, criterion, optimizer_pr)
# branchmark training
runtime = benchmark_fn( train, model_main, input, target, criterion, optimizer_main)
print(f"shape {shape}, main runtime:", runtime)
runtime = benchmark_fn( train, model_pr, input, target, criterion, optimizer_pr)
print(f"shape {shape}, PR runtime:", runtime)
print() output:
so you're just avoiding the extra inputs = F.pad(inputs, padding_2d, mode="constant", value=0) and having the Conv3d cuda kernel handilng that instead. Plus the cleaner code. |
i'll cc here my colleague @ic-synth that first flagged that and let him answer with more accuracy. Either way, it's a cleaner way to write a product over all shape dimensions. |
@bm-synth Can you benchmark inference? Training this autoencoder is not going to be a common use case. Timing is also worse in some cases and very minimal in others to the point it appears like random variation, the biggest difference is less than 1%. We can also include memory in benchmarking :) |
@hlky cc @yiyixuxu here's the updated benchmark scritpt that includes inference: def benchmark_fn(f, *args, **kwargs):
import torch.utils.benchmark as benchmark
t0 = benchmark.Timer(
stmt="f(*args, **kwargs)",
globals={"args": args, "kwargs": kwargs, "f": f},
num_threads=torch.get_num_threads(),
)
return f"{(t0.blocked_autorange().mean):.5f}"
def train(model, inputs, targets, criterion, optimizer):
model.train()
optimizer.zero_grad()
outputs = model(inputs)
if isinstance(outputs, tuple):
outputs = outputs[0] # only use the first tensor as output
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
return loss.item()
def inference(model, inputs):
model.eval()
with torch.no_grad():
for _ in range(50):
outputs = model(inputs)
return outputs
if __name__=='__main__':
print(f"diffusers version: {diffusers.__version__}")
to_kw = dict(device="cuda", dtype=torch.bfloat16)
torch.manual_seed(42)
model_main = CogVideoXCausalConv3d(in_channels=128, out_channels=512, kernel_size=3).to(**to_kw)
model_main = torch.compile(model_main, mode="max-autotune-no-cudagraphs")
optimizer_main = torch.optim.Adam(model_main.parameters(), lr=1e-3)
torch.manual_seed(42)
model_pr = CogVideoXCausalConv3d_PR(in_channels=128, out_channels=512, kernel_size=3).to(**to_kw)
model_pr = torch.compile(model_pr, mode="max-autotune-no-cudagraphs")
optimizer_pr = torch.optim.Adam(model_pr.parameters(), lr=1e-3)
criterion = nn.MSELoss()
# test 5 very different shape inputs
for shape in (
(16, 128, 64, 64, 64),
(8, 128, 128, 32, 32),
(4, 128, 256, 128, 128),
(4, 128, 512, 256, 256),
(2, 128, 1024, 512, 512),
):
input = torch.randn(shape, generator=torch.Generator().manual_seed(0)).to(**to_kw)
output_shape = ( shape[0], 512, shape[2], shape[3], shape[4] )
target = torch.randn(output_shape, generator=torch.Generator().manual_seed(0)).to(**to_kw)
# few iterations of training to warm up the model and compiler
model_main.train()
model_pr.train()
for _ in range(5):
train(model_main, input, target, criterion, optimizer_main)
train(model_pr, input, target, criterion, optimizer_pr)
# benchmark training
runtime_main = benchmark_fn( train, model_main, input, target, criterion, optimizer_main)
runtime_pr = benchmark_fn( train, model_pr, input, target, criterion, optimizer_pr)
print(f"shape {shape} train: main runtime {runtime_main} vs PR runtime {runtime_pr}")
# few iterations of inference to warm up the model and compiler
torch.compiler.reset()
model_main.eval()
model_pr.eval()
with torch.inference_mode():
for _ in range(5):
# few iterations of inference to warm up the model and compiler
inference(model_main, input)
inference(model_pr, input)
runtime_main = benchmark_fn( inference, model_main, input)
runtime_pr = benchmark_fn( inference, model_pr, input)
print(f"shape {shape} inference: main runtime {runtime_main} vs PR runtime {runtime_pr}")
print() to measure memory, i added to the print(f"MAIN Memory BEFORE conv:", torch.cuda.memory_allocated())
output = self.conv(inputs)
print(f"MAIN Memory AFTER conv:", torch.cuda.memory_allocated()) to both PR and main branch. Memory usage for first shape:
PR, train:
PR, inference:
Runtimes for first 2 shapes (NVIDIA H200):
Feel free to run it on your side and correct this benchmark script if you find any issue. |
@bm-synth i think there's been some misunderstanding: doing |
… of github.com:bm-synth/diffusers into inplace_sum_and_remove_padding_and_better_memory_count
thank you @ic-synth for clarifying this. @hlky @yiyixuxu i reverted that change and updated the PR msg. |
F.pad
, improved calculation of memory_count
F.pad
thank you @hlky here are the new results with the updated branch (NVIDIA H200): Script that measures runtimes for 2 shapes:
the other test for the comparison of numerical difference:
loss for 10 iterations of train for shape (16, 128, 64, 64, 64):
|
* Raise warning and round down if Wan num_frames is not 4k + 1 (huggingface#11167) * update * raise warning and round to nearest multiple of scale factor * [Docs] Fix environment variables in `installation.md` (huggingface#11179) * Add `latents_mean` and `latents_std` to `SDXLLongPromptWeightingPipeline` (huggingface#11034) * Bug fix in LTXImageToVideoPipeline.prepare_latents() when latents is already set (huggingface#10918) * Bug fix in ltx * Assume packed latents. --------- Co-authored-by: Dhruv Nair <[email protected]> Co-authored-by: YiYi Xu <[email protected]> * [tests] no hard-coded cuda (huggingface#11186) no cuda only * [WIP] Add Wan Video2Video (huggingface#11053) * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * map BACKEND_RESET_MAX_MEMORY_ALLOCATED to reset_peak_memory_stats on XPU (huggingface#11191) Signed-off-by: YAO Matrix <[email protected]> * fix autocast (huggingface#11190) Signed-off-by: jiqing-feng <[email protected]> * fix: for checking mandatory and optional pipeline components (huggingface#11189) fix: optional componentes verification on load * remove unnecessary call to `F.pad` (huggingface#10620) * rewrite memory count without implicitly using dimensions by @ic-synth * replace F.pad by built-in padding in Conv3D * in-place sums to reduce memory allocations * fixed trailing whitespace * file reformatted * in-place sums * simpler in-place expressions * removed in-place sum, may affect backward propagation logic * removed in-place sum, may affect backward propagation logic * removed in-place sum, may affect backward propagation logic * reverted change * allow models to run with a user-provided dtype map instead of a single dtype (huggingface#10301) * allow models to run with a user-provided dtype map instead of a single dtype * make style * Add warning, change `_` to `default` * make style * add test * handle shared tensors * remove warning --------- Co-authored-by: Sayak Paul <[email protected]> * [tests] HunyuanDiTControlNetPipeline inference precision issue on XPU (huggingface#11197) * add xpu part * fix more cases * remove some cases * no canny * format fix * Revert `save_model` in ModelMixin save_pretrained and use safe_serialization=False in test (huggingface#11196) * [docs] `torch_dtype` map (huggingface#11194) * Fix enable_sequential_cpu_offload in CogView4Pipeline (huggingface#11195) * Fix enable_sequential_cpu_offload in CogView4Pipeline * make fix-copies * SchedulerMixin from_pretrained and ConfigMixin Self type annotation (huggingface#11192) * Update import_utils.py (huggingface#10329) added onnxruntime-vitisai for custom build onnxruntime pkg * Add CacheMixin to Wan and LTX Transformers (huggingface#11187) * update * update * update * feat: [Community Pipeline] - FaithDiff Stable Diffusion XL Pipeline (huggingface#11188) * feat: [Community Pipeline] - FaithDiff Stable Diffusion XL Pipeline for Image SR. * added pipeline * [Model Card] standardize advanced diffusion training sdxl lora (huggingface#7615) * model card gen code * push modelcard creation * remove optional from params * add import * add use_dora check * correct lora var use in tags * make style && make quality --------- Co-authored-by: Aryan <[email protected]> Co-authored-by: Sayak Paul <[email protected]> * Change KolorsPipeline LoRA Loader to StableDiffusion (huggingface#11198) Change LoRA Loader to StableDiffusion Replace the SDXL LoRA Loader Mixin inheritance with the StableDiffusion one * Update Style Bot workflow (huggingface#11202) update style bot workflow --------- Signed-off-by: YAO Matrix <[email protected]> Signed-off-by: jiqing-feng <[email protected]> Co-authored-by: Aryan <[email protected]> Co-authored-by: Mark <[email protected]> Co-authored-by: hlky <[email protected]> Co-authored-by: kakukakujirori <[email protected]> Co-authored-by: Dhruv Nair <[email protected]> Co-authored-by: YiYi Xu <[email protected]> Co-authored-by: Fanli Lin <[email protected]> Co-authored-by: Yao Matrix <[email protected]> Co-authored-by: jiqing-feng <[email protected]> Co-authored-by: Eliseu Silva <[email protected]> Co-authored-by: Bruno Magalhaes <[email protected]> Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: lakshay sharma <[email protected]> Co-authored-by: Abhipsha Das <[email protected]> Co-authored-by: Basile Lewandowski <[email protected]> Co-authored-by: célina <[email protected]>
remove one call to symmetric padding in
F.pad
when running with non-replicate
pad mode, and instead let padding be done byConv3d
for a more efficient execution;